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COMPARISON  OF  SAMPLE  DESIGNS 
FOR  A  POPULATION  OF  FARMS 


Karl  v..  I  louscman  * 


INTRODUCTION 


Data  from  an  annual  farm  census  conducted  by  the  State  of  Wisconsin  for  1970 
and  1971  were  used  to  obtain  variances  for  a  study  of  many  alternative  sampling  plans. 
Sixteen  characteristics  were  selected  for  study  with  regard  to  patterns  of  distri- 
bution over  the  State  and  the  proportion  of  farms  reporting.     "Number  of  farms 
reporting"  is  commonly  used  when  referring  to  the  farms  that  are  producing  a  parti- 
cular commodity  or  reporting  a  positive  nonzero  answer  to  a  question.     Thus,  if 
Y^,...,        are  the  values  of  some  characteristic  Y  for  all  farms  in  the  population, 

the  "number  of  farms  reporting"  is  the  num.ber  of  farms  for  which  Y.  >  0. 

Wisconsin  is  a  good  State  for  purposes  of  this  study,  especially  because  of  its 
wide  variation  in  agriculture  from  north  to  south.     The  State  is  divided  into  nine 
crop  reporting  districts  (CRDs)   (fig.  1).     CRDs  are  State  subdivisions  used  for  sta- 
tistical purposes,  and  are  generally  made  up  of  homogeneous  groups  of  counties. 

Summary  data  for  1970  and  1971,  which  include  crops  grown  and  total  number  of 
farms  in  the  State,  were  derived  directly  from  the  original  data  (tables  1  and  2). 
No  adjustments  have  been  made  for  undereunumeration  or  overenumeration,  definitions, 
or  other  factors.     The  totals  and  averages  as  shown  in  columns  2  and  3,  for  example, 
are  not  official  estimates.     Numbers  in  parentheses  in  the  table  columns  correspond 
to  algebraic  descriptions  in  the  appendix.     A  column  that  appears  in  more  than  .one 
table  always  has  the  same  number.     Likewise,  corresponding  data  for  two  different 
years  also  have  the  same  column  nujnber. 

Note  that  the  proportions  of  farms  reporting  (column  7),  range  from  less  than 
1  percent  for  potatoes  and  snap  beans  to  100  percent  for  farmland.     In  fact,  the 
characteristics  were  ordered  according  to  the  proportion  reporting.  Population 
(number  of  persons  living  on  farms)  was  included  because  the  variation  among  farms 
is  low,  is  reported  by  nearly  all  farms,  and  is  distributed  geographically  more  or 
less  in  proportion  to  numbers  of  farms.     Some  characteristics  are  more  uniformly 
distributed  than  others  (table  3). 


*The  author  is  statistical  consultant,  retired  from  U.S.  Department  of  Agriculture. 
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SIMPLE  RANDOM  SAMPLING  OF  FARMS 


Column  (4)  presents  standard  deviations  among  all  farms  and  column  (5)  contains 
the  relative  variances,  which  may  be  interpreted  as  relative  variances  of  a  sample 
mean  for  samples  of  size  1.     Note  the  inverse  relation  between  the  proportion  re- 
porting, column  (7),  and  the  relative  variance,  column  (5).     For  a  sample  of  a  given 
size,  it  points  to  major  differences  in  the  coefficients  of  variation  (c's  of  v)  of 
sample  estimates  for  characteristics  depending  on  the  proportion  of  farms  reporting. 

Exercise  1.    Suppose  a  simple  random  sample  of  farms  is  to  he  selected  and  that 
the  desired  c  of  v  of  estimates  is  2  percent.     Using  the  1970  data  in  table  1^  find 
the  sample  sizes  for  farm  population  and  snap  beans  so  that  the  c  of  v  of  y  will  be 
2  percent  J  where  y  is  a  simple  average  of  the  values  of  Y  in  the  sample.  Answer: 
1,125  and  99,002.    The  answer^  1,125,  was  obtained  without  using  the  finite  popu- 
lation correction  (fpc)  since  the  sampling  fraction  is  only  about  1  percent.  For 
snap  beans y  the  fpc  must  be  used. 

Exercise  2.     'With  reference  to  exercise  ly  suppose  the  c  of  v  for  snap  beans  is 
set  at  10  percent  instead  of  2  percent.    Row  large  must  the  sample  he?  Answer: 
60,619,  which  means  a  sampling  fraction  of  60  percent.     In  a  simple  random  sample  of 
60,619  farms y  what  is  the  expected  number  of  growers  of  snap  beans  and  what  is  the 
expected  number  of  farms  reporting  farm  population?    Answer:    139  and  57,485. 

The  answers  to  the  above  exercise  clearly  point  to  a  sampling  problem  often  re- 
ferred to  as  the  problem  of  sampling  for  "rare  items."     In  the  absence  of  special 
techniques  to  identify  growers  of  a  particular  commodity  prior  to  sampling,  it  might 
appear  that  a  census  is  necessary  because  the  required  sampling  fractions  are  very 
large.     However,  there  have  been  strong  tendencies  for  rare  items  to  be  incompletely 
enumerated  unless  special  precautions  are  taken.     For  example,  suppose  that  in  the 
section  of  a  questionnaire  on  crop  acreages,  separate  questions  are  asked  for  all 
leading  crops.     Then,  an  "all  other"  question  is  asked  to  get  the  names  and  acreages 
of  any  remaining  crops.     Crops  in  the  "other"  category  might  be  underenumerated  by  a 
substantial  fraction.     To  reduce  total  error,  efforts  to  reduce  response  error  may  be 
more  important  than  spending  additional  resources  on  a  complete  census. 

The  mathematical  relationship  between  columns  (5)  and  (10)  is  very  useful, 
namely. 


+  1-P 


(1) 


P 


where 


is  the  relative  variance  among  all 
farms,  column  (5). 


-2 

Y  (N-1) 


_  ^  ^^ri  ^r^  is  the  relative  variance  among  farms 
^       Y'^CN      -  1)       reporting,  column  (10), 


N      =  is  the  number  of  farms  reporting,  column  (6), 


P      =    r  is  the  proportion  of  farms  reporting,  column  (7), 
W~ 


Y  = 


E  Y 

 i  is  the  average  for  all  farms,  column  (3),  and 

N 


N 

r 

Z  Y 

rl  is  the  average  per  farm  reporting,  column  (8) 

N 

r 


The  subscript  "r"  is  used  in  reference  to  a  subset  of  farms  reporting,  that  is,  the 

farms  for  which  Y.  >  0. 

1 

Exeraise  3.    Equation  1  is  exaot  if  the  population  variances  are  defined  by 
dividing  sums  of  squares  by  N  and       instead  of  N-1  and       -  1.    Show  that  this  is 

true. 

9 

Notice  in  tables  1  and  2  that  the  range  of  variation  in  V" ,  column  (10),  is 

2  ^ 

small  compared  to  the  variation  in  the  values  of  V  ,  column  (5) .     Also  note  that 
2 

V    is  not  related  in  a  definitive  way  to  P.     The  characteristics  which  have  the 
^  2 

largest  values  of        probably  have  frequency  distributions  with  a  high  degree  of 

skewness;  that  is,  a  relatively  small  number  of  farms  having  the  largest  values  of 
Y_j^  probably  account  for  a  substantial  part  of  the  total  of  Y. 

Good  sampling  practice  would  call  for  trying  to  identify  (prior  to  sampling) 
farms  with  extremely  large  values  of  Y^  and  including  all  (or  a  large  fraction  of) 

such  farms  in  the  sample.     If  farms  with  large  values  of  Y.  are  identified  and  put  in 

2  ^ 

a  separate  stratum  that  is  completely  enumerated,  the        for  the  part  of  the  popu- 
lation sampled  will  tend  to  be  smaller  and  contained  within  rather  narrow  limits.  In 
any  event,  equation  1  is  often  an  important  aid  in  forming  prior  judgments  of 
sampling  variances  and  in  developing  techniques  for  approximating  sampling  errors  per- 
taining to  the  numerous  estimates  that  might  be  produced  from  a  sample  survey. 

In  planning  surveys,  it  is  often  helpful  to  have  rough  approximations  of  sam- 
pling variances  available  without  delay.     An  experienced  sampler  can  make  good  guesses 
at  the  values  of        and  P,  and  from  equation  1  can  make  a  good  judgment  of  the  magni- 
tude of  V  and  hence  the  magnitude  of  the  sampling  error  for  any  size  sample  that  might 
be  under  consideration. 

For  simple  random  sampling,  ignoring  the  fpc,  it  follows  from  equation  1  that: 
9  u2  +  1-P 

^  f  =  ^-15  

2  - 

where  V  (y)  is  the  relative  variance  of  the  sample  mean,  y,  which  is  an  estimate 

of  Y.  One  might  add  a  factor  for  design  efficiency  to  equation  2.     That  is,  if  one 
judged  the  efficiency  (variance)  of  the  sampling  plan  under  consideration,  for 
example,  to  be  0.6  or  1.2  times  the  variance  for  simple  random  sampling,  one  could 
2 

adjust  V_  accordingly.     Of  course,  one's  ability  to  make  prior  judgments  of  sampling 
n 

error  improves  with  experience  and  knowledge  of  information  about  variance.  Even 
for  characteristics  that  have  not  been  included  in  a  previous  survey,  conjecture 


7 


about  sampling  error  can  provide  a  good  indication  of  whether  the  sampling  standard 
error  of  an  estimate  (or  class  of  estimates)  is,  for  example,  the  order  of  7  or  8 
percent  or  perhaps  3  or  4  percent.     Past  information  and  conjecture  about  sampling 
error  should  play  or  be  developed  to  play  an  important  role  in  determining  the  size 
of  sample  for  a    survey  and  the  content  of  a  questionnaire,  and  the  extent  of  domain 
estimation  (breakdown  of  the  data)  that  any  given  sampling  plan  is  likely  to  satis- 
factorily support.  11 

Equation  2,  or  a  similar  equation,  is  sometimes  helpful  in  developing  a  method 
for  approximating  sampling  errors  of  estimates  from  a  sample  survey.     This  is  in  lieu 
of  computing  variances  for  all  estimates  according  to  an  exact  formula  for  the  parti- 
cular sampling  design  involved.     It  should  be  pointed  out  that  for  simple  random 

saripling,  equation  2  extends  easily  to  domain  estimates.  Suppose  that  y^  is  an  esti- 
mate of  the  domain  mean  Y,    where  Y,  is  the  domain  total  divided  by  the     total  number 

d ,  a 

of  farms  (elements)  in  the  domain.     Instead  of  equation  2,  we  have: 

,    i  =  "dr^    ^-^d  (3) 

n  nP  J 

d 

2  - 

where  V     (y^)  is  the  relative  variance  of  y^ , 

2 

V^^  is  the  relative  variance  among  nonzero  values  of 


Y  within  the  domain. 


N 

and  P  ,  _    dr  is  the  number  of  farms  in  the  domain  with  nonzero 


d  = 


N    values  of  Y  divided  by  the  total  number  of  farms  in 
the  entire  population. 


Thus,  nP^  (the  expected  number  of  nonzero  values  of  Y  in  the  sample  and  in  the  do- 
main) is  a  major  factor  determining  the  relative  variance  of  y^. 

Exevoise  4.     With  reference  to  equation  2,  nP  is  the  expected  number  of  farms 
reporting  in  a  sample  of  n  farms.    Study  equation  2  with  this  in  mind  and  with 
reference  to  data  presented  in  tables  1  and  2.     Does  it  appear  that  most  of  the 
differences  in  sampling  variances  for  various  commodities  is  explained  by  variation 
in  nP? 

Note  that  y  may  be  regarded  as  the  product  of  two  random  variables,  p  and  y^. 


n 
n 
n 


where  p  _  _r_  proportion  reporting  in  a  random  sample  of  n  and 


y      _         -^ri      is  the  average  of  y  for  the  n    farms  reporting.     It  follows  from 


n 

r 


equation  2  that: 


r 


v2 

2,-.        „2,  -  ,  r     ,     (1-P)  (4) 


V  (y)     =  V  (py^)  = 


Pn  Pn 


1/  Houseman,  Earl  E.  "The  Survey  as  a  Measurement  Instrument,"  Agricultural  Econo- 
mics Research,  U.S.  Dept.  Agr.,  ERS,  Vol.  24,  No.  4,  October  1972,  p.  87. 
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2  1-P 

Let  Pn  =  n    and  V    =  .     This  gives: 

P  P  2  2 

n  n 
r 

which  provides  a  basis  for  determining  how  much  of  the  variance  of  y  is  associated 
with  the  variance  of  p  and  how  much  is  associated  with  the  variance  of  y^.     Note  that 

2 

n    is  the  expected  value  of  n    for  samples  of  n  farms  and  V     is  the  relative  variance 
r  ^  r  p 

of  p  for  n=l. 

The  comparative  magnitudes  of  the  two  components  of  variance  shown  in  equation  5 
have  implications  on  matters  of  sample  design  and  estimation  including  the  possibility 

of  double  sampling,  that  is,  using  a  large  sample  to  estimate  the  values  of  p,  or  N^, 
for  various  characteristics,  and  subsamples  of  the  large  sample  to  estimate  the  values 

of  y  . 

^r 

Exercise  5.    From  the  results  presented  in  table  Ij  determine  the  relative 

variance  of  y  for  soybeans  assuming  a  simple  random  sample  of  1,000  farms.  Answer: 
0.0734.    Find  the  values  of  the  two  components  given  in  equation  5.    As  a  check, 
they  should  add  to  0.073A,  except  for  rounding  errors. 

Exercise  6.     An  estimator  of  N^,  the  total  number  of  farms  reporting ,  is  Np 

where  N  is  the  total  number  of  farms  in  the  population  and  p  is  the  proportion  of 

farms  reporting  in  a  sample  of  size  n.    Assuming  n  =  10,000,  find  the  standard  error 

of  Ny  and  of  Np  for  potatoes  using  the  1970  data.    Answer:    The  standard  error  of  Ny 

is  10,718  acres  or  26.7  percent.  The  standard  error  of  Np  is  82.2  farms  or  11.1  per- 
cent. 

Known  Numbers  of  Farms  Reporting 

Since  the  c's  of  v  are  large  for  characteristics  where  P  is  small,  consideration 
of  all  possible  ways  of  reducing  the  sampling  variance  for  such  characteristics  is 
important.     How  valuable  would  information  on  the  number  of  farms  reporting  be  in 
reducing  sampling  variance?     This  is  part  of  the  general  question  on  value  and  cost 
of  auxiliary  data  that  might  be  incorporated  in  a  sampling  frame. 

Suppose  the  numbers  of  farms  reporting,  N^  in  column  (6),  are  known.     Then,  one 

could  use  N^y^  as  an  estimator  of  a  population  total.     Assuming  a  simple  random 

sample  of  all  farms,  how  does  the  variance  of  N  y    compare  with  the  variance  of  Ny? 
n 

Although  -        ^'*^i,   is  the  ratio  of  random  variables  because  n     (the  number  of  farms 

y    =   .  r 

r  n 

r 

reporting  in  a  sample  of  size  n)  is  a  random  variable,  the  variance  of  y  comes  under 
a  special  condition  that  does  not  require  the  formula  for  the  variance  of  a  ratio  of 


9 


random  variables.     IJ    Among  all  possible  samples  of  size  n,  there  are  samples  which 
have  the  same  number,  n  ,  of  farms  reporting.     The  relative  variance  of  N  y  among 
v2  " 

such  samples  is    .     An  approximate  average  relative  variance  among  all  samples  of 

r 

n  is  obtained  by  substituting  the  expected  value  of  n    in  place  of  its  observed 
2  ^ 

V 

value.     Thus    is  a  good  approximation  (unless  n    is  very  small)  of  the  relative 

n 

r 

variance  of  N^y^  among  all  samples  of  n.     Therefore,  since  n^  =  nP,  the  relative 

—                                    r                                                                     —  V 
variance  of  N  y    is  approximately  — and  since  the  relative  variance  of  Ny  is   , 

r  2  ^r 

we  have  — :r—  for  comparison  with  V    where  n  =  1.     Let  D  =  — -  ,  which  shows  the 

PV 

variance  of  N^y^  as  a  proportion  of  the  variance  of  Ny.     Values  of  D,  which  will  be 
referred  to  as  design  factors,  are  presented  in  column  (11)  of  table  4.     As  an 
example,   .66  for  alfalfa  means  that  in  1970  the  variance  of  N^y^  is  66  percent  of 
the  variance  of  Ny,  or  that  knowledge  of  N^  could  have  reduced  the  sampling  variance 
by  34  percent. 

Exercise  7.    Suppose  a  list  of  the  741  growers  of  potatoes  in  1970  is  available 
and  that  a  simple  random  sample  of  n^  growers  is  selected  from  this  list.     For  com- 
parison with  the  answer  to  exercise  6^  assume  that  n^  =  73,  which  is  the  expected 
nwriber  of  potato  growers  in  a  sample  of  10^000  farms.    The  estimator  is  N^y^,  where 
=  741  and  y^  is  the  mean  of  a  sample  of  n^  growers  selected  for  the  list.  Find 
the  standard  error  of  N^y^.    Answer:    9,717  acres  or  Ih.l  percent. 

Generally,  an  agricultural  statistics  program  must  be  based  on  a  list  frame 
(a  list  of  farms  or  farm  operators),  an  area  frame,     or  a  combination  of  the  two. 
For  a  statistically  efficient  basis  for  sampling  for  a  wide  variety  of  agricultural 
surveys,  auxiliary  data  about  sampling  units  is  important.     But  there  is  a  sub- 
stantial cost  in  obtaining  and  including  auxiliary  data  in  a  sampling  frame.  Column 
(11)  provides  some  indication  of  the  importance  of  having  information  about  which 
farms  are  producing  various  agricultural  products. 


y  Hansen,  Hurwitz,  and  Madow.  Sample  Survey  Methods  and  Theory,  John  Wiley  & 
Sons,  Inc.,  Vol.  1,  p.  159. 
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Table  4 — Design  factors  for  two  estimators  and  some  other  measures 


Ratio  estimator 

Design 

factors  for 

estimator,  N  y 

V 

X 

Characteristic 

Design  factors 

Correlation 

1/ 

1970 

1971 

1970 

1970 

1970 

(1) 

(11) 

(11) 

(12) 

(13) 

(14) 

Farmland 

:  1.00 

1. 00 

Population 

:  0.87 

0.86 

1.82 

.31 

1.27 

Alfalfa 

0.66 

0.69 

0.82 

.50 

0.75 

All  corn 

0.  85 

0.90 

0.71 

.55 

.047 

All  pasture 

0.  80 

0.81 

0.83 

.42 

0.49 

Milk  cows 

0.A7 

0.52 

0.  98 

.38 

0.74 

Beef  cattle 

0.82 

0.83 

0.93 

.27 

0.21 

Clover  &  timothy 

0.73 

0.  46 

1.00 

.10 

0.21 

Hay  for  silage 

0 . 60 

0 . 59 

0.  91 

.31 

0.23 

Cattle  marketed 

0.92 

0.94 

0.98 

.20 

0.067 

Soybeans 

0.66 

0.  69 

0.98 

.17 

0.10 

Peas 

0.78 

0.87 

0.97 

.21 

0.074 

Sheep 

0.74 

0.70 

1.00 

.044 

0.072 

Spring  wheat 

0.81 

0.59 

0.98 

.22 

0.039 

Potatoes  : 

0.83 

0.80 

0.98 

.26 

0.030 

Snap  beans  : 

0.71 

0.81 

1.00 

.036 

0.022 

—  =  Not  applicable. 

1_/    Numbers  in  parentheses  correspond  to  algebraic  descriptions  in  the  appendix. 


Farmland  As  An  Auxiliary  Variable 

Owing  to  increasing  specialization  in  agriculture,  acres  in  farmland  have  be- 
come less  effective  as  an  auxiliary  variable,  except  in  special  situations.  Variance 
equations  for  the  relative  variances  of  the  ratio  and  mean  estimators  in  a  good  form 
for  comparison  and  interpretation,  and  assuming  n  =  1,  are: 

V^X^)      =    v2    +    v2    -     2p  V^V^  and  (6) 

V^Cy)  =  V2  (7) 
where  X  is  acres  of  farmland, 

v2  is  the  relative  variance  of  X,  which  for  1970  is  0.725,  the  first  entry 
^  in  column  (5)  , 

Y    is  any  characteristic  other  than  farmland. 


11 


is  the  relative  variance  of  Y  and  its  values  are  found  in  column  (5) 
except  for  the  first  entry,  and 


P 


is  the  correlation  between  X  and  Y. 


Thus,  dividing  the  relative  variance  (or  variance)  of  the  ratio  estimator  by  the 
relative  variance  (or  variance)  of  the  mean  estimator  gives: 


If  the  value  of  D  is  0.9,  for  example,  the  variance  of  the  ratio  estimator  is  10  per- 
cent less  than  the  variance  of  the  mean  estimator.     Thus,  the  value  of  D  is  an  in- 
verse measure  of  efficiency. 

Values  of  D,  equation  8,  for  1970  are  listed  in  table  A,  column  (12).  Values 
for  1971  are  not  shown  because  they  are  very  similar.     Notice  that  the  ratio  esti- 
mator is  effective  for  only  three  characteristics:     alfalfa,  corn,  and  pasture.  All 
three  are  acreages,  each  is  reported  by  a  high  proportion  of  the  farms,  and  each 
accounts  for  about  15  to  20  percent  of  the  farmland.     For  the  remaining  character- 
istics, the  ratio  estimator  is  ineffective. 

V 

The  value  of  D  is  less  than  1  when    X        „        To  help  understand  the  conditions 

where  the  ratio  estimator  is  effective,  the  values  of  o  and  .  are  given  in  columns 
(13)  and  (14).  ^Y 

Exercise  8.    Examine  equation  8  and  note  that  when       is  small  relative  to  V^, 

the  value  of  D  will  he  close  to  1,  especially  for  small  to  moderate  values  of  p.  On 
the  other  hand^  when  V    is  considerably  larger  than  V  ,  the  potential  for  loss  or 

A  Y 

gain  (value  of  Dj  is  quite  sensitive  to  the  magnitude  of  the  correlation.  Study  the 
results  in  columns  12,  13,  and  14.    Note  that  for  the  characteristics  at  the  bottom 

of  the  list  that  ^  is  small  and  the  values  of  D  are  close  to  1.    Compare  the  results 

for  population  and  hay  for  silage,  two  characteristics  that  have  the  same  corre- 
lation.   Suppose  Q  =  .8  for  population,  alfalfa,  and  potatoes.    Find  the  values  of 
D,  assuming  the  values  of        are  as  given  in  table  2,  column  (14).    Answer:  0.58, 

0.36,  and  0.95.     Vlhat  do  you  conclude? 


In  the  State-farm  censuses  of  Wisconsin,  farms  were  identified  by  townships, 
which  are  the  smallest  political  subdivisions  of  the  State.  For  most  purposes,  a 
township  is  too  large  to  be  suitable  as  a  sampling  unit.  However,  a  study  of  the 
township  as  a  sampling  unit  reveals  several  aspects  of  the  general  problem  of 
choosing  a  sampling  unit  and  of  selecting  auxiliary  information  about  them.  Some 
questions  of  interest  are:  How  does  the  design  efficiency  of  the  township  (com- 
pared to  the  individual  farm)  relate  to  P  (the  proportion  reporting),  to  the  geo- 
graphic distribution  of  the  characteristics,  to  the  method  of  estimation,  and  to 


D 


=  1  + 


(8) 


CLUSTER  SAMPLING 


12 


stratification?     Stratification  will  be  considered  later. 


A  few  townships  had  only  one  or  two  farms.     Hence,  for  purposes  of  this  analy- 
sis, townships  with  fewer  than  four  farms  were  combined  with  adjacent  townships, 
giving  a  total  of  1,462  individual  townships  and  township  combinations  which  will  be 
referred  to  simply  as  "townships."    The  average  number  of  farms  per  township  was 


1,462  1,462 

Notation  used  in  the  specifications  of  alternative  plans  for  sampling  townships 
will  be: 

is  the  value  of  Y  for  the  j^^  farm  in  the  i*"^  township, 

is  the  total  number  of  farms  in  the  i^^  township, 

Y      =  lY      is  the  total  of  Y  for  the  i^^  township, 
j 

M 

N  «  ZN^  is  the  total  number  of  farms  in  the  population, 

M  is  the  number  of  townships  in  the  population, 

m  is  the  number  of  townships  in  a  sample  of  townships, 

n^  is  the  total  number  of  farms  in  the  i^^  township  in 
a  sample, 


n 


Ln^    is  the  number  of  farms  in  a  sample  of  townships 


N 

N  =  -    is  the  average  number  of  farms  per  townshio, 
M 

I  Y  . 

-    _     i    ti      is  the  average  value  of  Y  per  township, 
t  "  M 

E  Y 

=    _     i    ti      is  the  population  average  per  farm,  and  in  a 
^    ~  N 

sample  of  townships  y^^    y^,     and  y  correspond  to  Y^_^    Y^ ,  and  Y. 
For  the  individual  farm  as  the  sampling  unit,  the  notation  will  be: 
Y^  is  the  value  of  Y  for  the  i^^  farm  in  the  population, 
^^i 

Y    =  is  the  population  average  per  farm  which  is  the  same  as  Y 

under  the  township  notation,  and  f  instead  of  n  will  be  used 
for  the  number  of  farms  in  the  sample. 

Other  notation  follows  from  the  above  definitions. 
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Four  alternatives  have  been  selected  for  comparison: 


1.  A  simple  random  sample  of  m  townships  is  selected  from  the  population  of  M 
townships.     The  sample  townships  are  enumerated  completely.     The  estimator 
of  the  population  total  of  Y  and  its  relative  variance  are  given  by: 
m 

■'I  m  t 

2     '  i) 

V   (y.)  =   z~-  (10) 


r  m 


I   (Y   .  -    Y  )^ 


where    V^(Y^.)     =  ^ 


ti  t 


Y^  (M-1) 


The  same  specifications  as  in  the  first  alternative  apply  except  that  a  ratio 
estimator  is  used,  the  auxiliary  variable  being  number  of  farms.     The  esti- 
mator and  its  relative  variance  are: 


In. 
1 


(y^)     =     -     {V^(Y   .)     +    V^(N.)     -     2     Gov  (Y  . ,  N.)}  (12) 
/  m  ti  1  ti  1 


"  -  2 

E   (N,     -  N) 


2  i 
where    V     (N  )     =  ^ 


and    Gov  (Y^^,  N^) 


N  (M-1) 


Z   (Y  .  -  Y  )    (N.   -  N) 
ti        t  1 

1  

(Y^)    (N)  (M-1) 


Exercise  9.    Note  that  y^  is  simply  the  average  per  farm  in  the  sample  multi- 
plied by  ti.    Is  n  in  equation  11  a  constant?    Explain  why  the  variance  formula^ 
equation  12  is  the  correct  one  to  use. 


3.     A  random  sample  of  m  townships  is  selected  with  replacement  and 
probabilitie 
variance  are 


probabilities  proportional  to  N^.     The  estimator  and  its  relative 
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N(i) 
m 


1  1 
Y 


J  m 


M 

.    1    N . 

1  1 


Y)2 


(13) 


(14) 


The  above  three  plans  are  to  be  coTnpare(d  with  a  simple  random  sample  of  f 
farms.     The  estimator  and  its  relative  variance  are: 


Ny 


(15) 


N 

Z(Y. 


Y^  (N-1) 


(16) 


To  compare  the  alternatives,  we  will  assume  a  constant  sampling  fraction, 
namely  1.     This  means  m=l  and  that  f=69.6  for  1970  and  f=67.1  for  1971.  The 


relative  variances  of  the  four  estimators  (table  5,  columns  (15),  (16),  (17),  and 
(18)  respectively)  are  used  to  compare  townships  as  sampling  units  to  farms. 


Divide  the  variances  among  townships,  columns  (15),   (16),  and  (17),  by  the 
variances  among  farms,  column  (18).     For  1970,  these  design  factors  are  shown  in 
columns  (19),   (20),   (21),  and  (22)  of  table  6.     Take  corn  as  an  example.     The  design 
factor  26.0  means  that  the  sampling  variance  for  the  first  alternative  is  26  times 
larger  than  the  sampling  variance  for  the  fourth.     Thus,  columns  (19),   (20),  and 
(21)  display  the  high  degree  of  inefficiency  that  generally  exists  for  a  "large" 
sampling  unit.     "Large"  refers  to  the  numbers  of  farms  in  the  sampling  units  for 
which  Y^  >    0.     Note,  for  example,  that  the  average  number  of  farmers  per  township 

that  grew  potatoes  was  less  than  1.     If  potato  growers  are  widely  scattered  (not 
concentrated  in  a  few  townships),  townships  are  small  in  size  with  regard  to  potato 
growers  compared  with  characteristics  at  the  top  of  the  list.     The  wide  differences 
among  characteristics  points  up  the  importance  of  making  a  good  choice  of  size  of 
area  sampling  units  depending  on  the  objectives  of  the  survey. 

Exercise  10.     Suppose  a  simple  random  sample  of  100  townships  is  selected  and 
equation  9  is  used.     What  is  the  relative  variance  of  y  ^  for  corn?  Answer: 
0.01245.    In  a  sample  of  100  townships y  one  would  expect  about  6,950  farms  in  1970. 
Assuming  a  simple  random  sample  of  6,950  farms,  what  is  the  sampling  variance  of 

Answer:    0.000479.    Do  the  two  variances  differ  according  to  the  design  factory 

26.0,  shown  in  column  (19)  of  table  6? 

Exercise  11.     Study  columns  (19),   (20),  and  (21).     Prepare  a  logical  explanation 
for  the  reduction  in  the  design  factors  as  P  decreases  —  that  is,  the  loss  in 
efficiency  of  the  township  is  greatest  when  P  is  large.     Under  what  conditions  is 
this  reasonable?    What  are  the  implications  with  regard  to  a  measure  of  size  of 
sampling  units? 

It  is  also  interesting  to  compare  the  four  alternatives  by  using  the  mean  per 
township  estimator,  column  (15),  as  a  base.     Thus,  the  1971  variances  in  table  5 
were  divided  by  the  1971  variances  in  column  (15) ;  the  results  are  in  the  right  half 
of  table  6. 
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Exeraise  12.  Study  column  (24).  For  the  ratio  estimator ,  the  design  factor  is 
0.065  for  farm  population  and  1.01  for  potatoes.    Eocplain  this  large  difference. 


Exercise  12.    Assume  that  the  township  is  the  sampling  unit  and  consider  the 
following  two  alternatives:     (1)    Select  a  sample  of  townships  using  equal  probabili- 
ties of  selection.     Use  the  ratio  estimator,         /'^^  characteristics  where  the  design 
factor  in  column  (24)  is  less  than  1  and  the  mean  estimator ,  y-^,  for  characteristics 
where  the  design  factor  is  approximately  1  or  larger.     (2)    Select  the  townships  with 
probabilities  proportional  to  N_j^  (number  of  farms)  which  would  require  using  73  as  the 
estimator  for  all  characteristics.     Which  of  these  two  alternatives  would  you  choose? 
Why? 

Exercise  14.     In  1970  there  were  234  growers  of  snap  beans.     Suppose  there  was 
one  grower  in  each  of  234  townships.     In  this  case,  how  would  the  sampling  variances 

for  the  four  estimators  (y-j^,         cmd  74)  compare? 

Exercise  IS.     Suppose  the  23^  growers  of  snap  beans  were  all  located  within  5 
townships .     How  would  the  sampling  variances  for  the  four  estimators  compare?    Do  you 
agree  that  the  sampling  variance  would  be  very  large  for  all  of  the  estimators,  even 
for  sampling  fractions  as  large  as  25  or  50  percent? 

Exercise  16.     What  do  the  above  analyses  of  the  township  as  a  sampling  unit  in- 
dicate regarding  selection  and  use  of  auxiliary  data  for  incorporation  in  a  sampling 
frame?    Does  it  appear  that  a  substantial  investment  in  a  sampling  frame,  including 
obtaining  relevant  auxiliary  information,  might  be  worthwhile  and  perhaps  necessary 
in  some  oases?  Discuss. 

STRATIFICATION—SAME  SAMPLING  FRACTIONS  APPLIED  TO  ALL  STRATA 

Simple  geographic  stratification,  using  a  constant  sampling  fraction,  can  gener- 
all7  be  relied  on  to  provide  some  reduction  in  sampling  variances.     Quite  often  the 
reductions  are  small,  but  the  cost  of  stratification  might  also  be  ver7  small.  Un- 
less one  engages  in  a  high  degree  of  refinement,  geographic  stratification  is  gene- 
rall7  inexpensive  and  eas7  to  appl7.     How  effective  is  it? 

Design  factors  are  for  1971  for  three  levels  of  stratification:     9  crop  re- 
porting districts  (column  27),  72  counties  (column  28),  and  1,462  townships  (column 
29).     These  design  factors  are  for  stratified  random  sampling  with  a  constant  sam- 
pling fraction  (table  7) .     The7  are  sampling  variances  for  the  three  levels  of  strati- 
fication expressed  as  a  proportion  of  the  sampling  variance  for  a  simple  random  sample 
of  farms. 

Exercise  17.  For  com,  find  the  relative  standard  error  of  the  mean  of  a  random 
sample  of  1,000  farms  stratified  by  counties.  Use  the  relative  variance  from  table  1 
and  the  design  factor  from  table  7.    Answer:    5.2  percent. 

Stud7  the  results  in  columns  (27),   (28),  and  (29)  with  regard  to  the  distribu- 
tions of  the  commodities  by  CRD  (table  3) .     One  might  have  anticipated  that  the 
gains  from  stratification  would  have  been  greater  for  commodities  with  the  most 
geographic  concentration.     However,  the  last  four  commodities  on  the  list  are  more 
concentrated  than  those  at  the  top,  but  the  impact  of  stratification  was  somewhat 
less.     Remember,  the  comparisons  being  discussed  assume  a  constant  sampling  fraction. 

Be  cautious  about  judging  the  impact  of  stratification  from  differences  among 
stratum  means.     For  stratified  random  sampling,  the  sampling  variance  is  an  average 
of  within-stratum  variances.     In  general,  it  is  better  to  try  judging  the  impact  of 
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stratification  with  regard  to  within-stratum  variation.  That  is  not  easy  to  do  when 
the  within-stratum  variances  differ  widely  from  stratum  to  stratum  as  in  the  case  of 
potatoes . 

Table  7    —  Design  factors  for  stratified  random  sampling,  1971 


Characteristic 
(1)  1/ 

Level  of  stratification 
■             (constant  sampling  fraction) 

Optimum 
alloca  t  ion 

Dy  L.tUJ 

(30) 

None 

CRD 
(27) 

County 
(28) 

Township 
(29) 

Farmland 

•  1.000 

.  981 

•  y\JO 

— 

Population 

;     1 . 000 

0.996 

0.985 

0.947 

0.989 

Alfalfa 

;  1.000 

0.929 

0.898 

0.813 

0.908 

All  corn 

1.000 

0. 934 

0 . 893 

U  .  OHy 

0.797 

All  pasture 

•  1.000 

0.926 

0.783 

Milk  cows 

•  1.000 

0.993 

0.962 

0.897 

0.978 

Beef  cattle 

•  1.000 

0.981 

0.968 

0.935 

0.837 

Clover  &  timothy 

1 . 000 

0. 826 

0 . 780 

U  .  D  jO 

0.616 

Hay  for  silage 

1.000 

0.996 

0.986 

0.951 

0.976 

Cattle  marketed 

1.000 

0.995 

0.992 

0.971 

0.578 

Soybeans 

1.000 

0.982 

0.948 

0.908 

0. 543 

Peas 

1.000 

0.997 

0.991 

0.982 

0.489 

Stock  sheep 

1.000 

0.997 

0.995 

0.988 

0.806 

Spring  wheat 

1.000 

0.982 

0.957 

0.919 

0.619 

Potatoes 

1.000 

0.998 

0.983 

0.958 

0.472 

Snap  beans 

1.000 

0.996 

0.987 

0.935 

0.315 

Average,  all  * 

characteristics 

1.000 

0.969 

0.950 

0.896 

0.714 

— ,  =  Not  available. 

1/    Numbers  in  parentheses  refer  to  algebraic  descriptions  in  the  appendix. 

Table  8  shows  average  alfalfa  and  potato  acreages  per  farm  by  >CRDs.  This 
illustrates  that  one  cannot  accurately  judge  gains  from  stratification  solely  from 
information  contained  in  this  table.     For  alfalfa,  the  relative  variance  among  the 
CRD  means  is  less  than  1  and  for  potatoes  more  than  12.     But  as  shown  by  table  7 
the  gain  from  stratification  is  less  for  potatoes.     Note  the  gains  from  optimum 
allocation,  column  (30). 

Table  9  illustrates  in  another  way  the  point  about  between-stratum  variances  as 
a  basis  for  judging  stratification.     The  simple  analysis  of  variance  table  is  used 
to  display  between-  and  within-stratum  variances  for  alfalfa  with  stratification  by 
CRD  and  township.     Note  the  size  of  the  mean  square  among  CRDs  compared  to  the  mean 
square  among  townships.     Defining  strata  with  the  idea  of  maximizing  the  variance 
among  them  can  be  misleading,  especially  if  their  number  is  not  fixed. 

Before  considering  alternative  allocations  of  a  sample  to  CRDs,  it  is  inter- 
esting to  examine  the  impact  of  stratification  when  the  township  is  the  sampling 
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unit.     For  comparing  the  township  with  the  individual  farm  as  sampling  units,  the 
variances  in  columns  (16)  and  (32)  within-state  and  within-CRD  variances,  respectively 
(table  10),  must  be  multiplied  by  67.14  to  convert  the  variances  among  townships  to  a 
basis  of  one  farm.    Multiplying  column  (16)  by  67.14  and  dividing  by  column  (5)  gives 
column  (33) ,  the  design  factors  for  the  township  compared  to  the  farm  when  there  is 
no  stratification.     To  illustrate,  the  design  factor  for  "population"  is 
(67.14)   (0.038)  =  5.36.     Column  (34)  is  derived  from  columns    (31)  and  (32)  in  the 
0.476 

same  way.     The  design  factors  in  column  (34)  are  somewhat  less  than  those  in 
column  (33).     In  other  words,  the  loss  in  efficiency  when  the  township  in  the 
sampling  unit  is  not  as  great  when  stratification  is  applied.     One  might  also  say 
that  gains  from  stratification  are  somewhat  greater  for  the  township  than  for  the 
individual  farm.     From  another  view,  one  might  say  that  stratification  is  more  im- 
portant when  the  sampling  units  are  large. 


Table  8  — 

Means  per  farm  by 

CRD,  alfalfa 

and  potatoes,  1971 

Average  acreage  per  farm 

CRD 

Alfalfa  ; 

Potatoes 

1 

24 

.0  ^ 

0.14 

2 

10 

.9 

0.66 

3 

25 

.1 

0.99 

4 

36 

.5 

0.02 

5 

24 

.0 

2.15 

6 

39 

.0 

0.07 

7 

42 

.2 

0.09 

8 

32 

.1 

0.10 

9 

30. 

0.86 

State  average 

30 

7 

0.43 

Table  9 

Analysis  of  variance,  alfalfa,  1971 

Source  of 

Degrees  of  : 

Sum  of 

:  Mean 

variation 

freedom  : 

squares 

:  square 

Total 

98,155 

121,050,012 

1,233 

Among  CRDs 

8 

8,615,115 

1,076,889 

Within  CRDs 

98,147 

112,434,897 

1,146 

Among  townships 

1,461 

24,066,000 

16,472  1/ 

Within  townships 

96,694 

96,984,000 

1,003 

]^/     Derived.     The  within-township  mean  square  1,003  was  available  on  the 
computer  printout  but  the  sums  of  squares  for  townships  were  not.  Although 
these  numbers  were  derived,  they  are  accurate  to  at  least  three  digits. 


20 


^1 


to 

•H 

4= 


CO  vD 
•H 


C/3  ^ 

in 


O    ON    LO  ,— I 


LO  CO  CN 


th  in  CN  <r 


CN 

CN 

•H 

J2 

(0 

tov 

01 

oo 

o 

O 

c 

o 

00 

r-l 

o 

in 

CN 

O 

C7^ 

0^ 

OO 

o 

in 

cC 

o 

"^^^ 

in 

ro 

CX3 

ro 

^ 

vD 

CN 

1—1 

o 

CO 

•H 

(0 

to 

<-\ 

o 

iH 

CN 

00 

<!■ 

O 

QJ 

<T 

00 

ro 

y-\ 

00 

00 

<-\ 

-cr 

C3> 

C3N 

O 

CM 

CO 

C3^ 

CN 

T-\ 

C3> 

CN 

iH 

.H 

in 

O 

O 

o 

O 

O 

rH 

CN 

O 

O 

C?N 

0^ 

OO 

in 

ro 

c  • 

iH 

rH 

CO 

O  X 

•H 

G. 

• 

•H  C 

CO 

J3  0) 

a 

CO  Ou 

E 

3  CO 

00 

O 

00 

r>. 

CO 

o 

vO 

O 

U-4 

O 

ro 

in 

o 

(TV 

CN 

00 

CN 

vO 

o 

tH 

O 

CO 

vO 

CNl 

CN 

m 

CN 

•<r 

o 

vO 

O 

iH 

42 

CO 

O 

o 

o 

iH 

O 

rH 

-J- 

o 

iH 

CN 

o 

CO 

C7\ 

S  r- 

iH 

iH 

rH 

rH 

•H 

> 

tO 

•H 

CO 

Xl 

CO  C 

C 

(U  O 

O  -H 

c  ••-> 

00 

to  Ci 

c: 

•H  'H 

o 

M  M 

<■ 

ro 

1 

rH 

<r 

CM 

00 

rH 

?d 

>  CO 

CNJ 

00 

rH 

<f 

\D 

in 

O 

CN 

CN 

(1) 

CO 

tu  -o 

o 

rH 

CN 

CO 

iH 

vD 

r~< 

in 

O 

•<r 

r-^ 

<y\ 

0) 

> 

iH 

iH 

-H 

ON 

CN 

00 

O 

CO 

o 

•H  O 

CN 

CN 

.H 

iH 

00 

0^ 

c 

CO 

to  to 

•H 

.H  t-4 

)-i 

<U  43 

CO 

>-(  CU 

> 

CjO 

-  rH 

^  to 

> 

•H 

rH  O 

O 

00 

iH 

in 

4J 

■u 

o- 

o 

CO 

<r 

o 

CN 

m 

vO 

00 

^ 

vD 

CO 

rH 

O  (-1 

o 

tH 

en 

m 

rH 

in 

00 

CN 

O 

O 

0-) 

0) 

•H  cu 

iH 

iH 

iH 

as 

cn 

CN 

o 

■U  IH 

CO  cB 


3  to 

CU  14-1 
O  rH 
< 


U 

3  W 

4-.  U 

CO  o 

ft)  o 


■H  -a 

4-1  c 


M-l  > 

o  o 

(U  iH 

pa  cj 


0)  <u 

00  4-1 

CO  OJ 
tH  4«i 


4-1  XI  CO 

>^  4J  >^  to 

to   to  O  0) 

a:  cj  CL, 


0)  CD 

4=  C 

S  CD  to 

(U  0) 

oo  O  43 

a  c  +J 

0)  "H  t^  a 

0)    (-1  4-1  CO 

x.  a.  o  c 
m 


4-) 

4-' 

CO 

c 

0) 

u 

tu 

rH 

o 

43 

4-1 

CO 

to 

CO 

a. 

CJ 

to 

S 

14-1 

•H 

c 

-H 

4J 

•H 

a 

u 

CO 

CX  (1) 

0) 

CO 

a) 

a 

O 

er 

c 

•H 

o 

to 

4-1 

1 

d) 

CO 

21 


Exercise  18,     Colvmn  (27)  in  table  7  equals  the  entvzes  ^n  coliwm  (31)  d^v^ded 
by  the  entries  in  coVjrm  (5).    It  shows  the  effectiveness  of  strat^f'Lcatn.on  hy  CRD 
when  the  farm  is  the  sampling  unit.     With  reference  to  table  10    dzvvde  colwm  (32) 
by  colwm  (16),  which  gives  corresponding  desigr.  factors  for  the  toimshzp  Con^are 
the  results  with  colwm  (27).     ^^%xt  does  this  comparison  show  regai'dvng  the  effect%ve- 
ness  of  stratification? 

Exercise  19.     What  is  the  differe-noe  between  colwm  (ZZ)  in  table  6  and  colwm 
(20)  in  table  6? 


STRATIFICATION— ALLOCATION  OF  SAMPLE  TO  CRDs 

To  illustrate  the  impact  on  sampling  variance  of  alternate  allocations  of  a 
sample  to  strata,  CRDs  will  be  used  as  strata.     Three  characteristics  have  been  se- 
lected for  this  purpose:     alfalfa,  beef  cattle,  and  potatoes.    Alfalfa  and  potatoes 
represent  two  widely  different  geographic  patterns  of  production.     Beef  cattle  falls 
between  the  two. 

For  stratified  random  sampling,  general  formulas  for  the  estimator  of  a  popu- 
lation total  and  its  relative  variance  are: 


=  ^Vh 


V^(y)  = 


2-2 

NY 


2  2 


(17) 
(18) 


where  N,    is  the  population  number  of  farms  in  stratum  h, 
h 

N  =  ZN,      is  the  total  number  of  farms  in  the  population, 
h 

y,      is  the  sample  mean  for  stratum  h, 
h 

n,     is  the  size  of  the  sample  in  stratiam  h, 


hi 


is  the  variance  within  stratum  h. 


N, 


is  the  value  of  Y  for  the  i 

ZY, 


th 


farm  in  stratum  h, 
hi     is  the  population  mean  for  stratum  h,  and 


Y      is  the  overall  population  mean. 

Equation  18  gives  the  relative  variances  of  y  for  a  sample  of  size  n,  where 
n  =  Zn^.     For  comparison  with  previous  results,  the  relative  variance  of  y  should  be 
expressed  on  the  basis  of  a  hypothetical  sample  of  one  farm.     This  is  accomplished 
by  multiplying  the  right  side  of  equation  18  by  n.  Thus, 


(y) 


2  -2 
N  Y 


(19) 


is  a  general  expression  for  the  relative  variance  of  y  expressed  on  the  basis  of  a 
sample  of  one  farm.     Equation  19  will  be  used  to  find  the  relative  variance  of  y 
for  alternate  allocations. 
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Exercise  20.    For  beef  cattle  and  197 Oy  verify  the  sample  allocations  shown  in 

colunms  (36) ^   (27 )y  and  (-38)  of  table  13.     Use  the  within- stratum  standard  deviations 

shown  in  column  (41)  of  table  13  and  N    and  Y    shown  in  table  11. 

h  n 

Exercise  21.    Using  equation  19 y  verify  the  following  two  relative  vajriances  for 
jilfalfa  in  table  12:    1.261,  the  relative  variance  for  1971  when  the  sample  is  allo- 
cated in  proportion  to       in  1971;  and  1.192,  the  relative  variance  for  1971  when  the 

sample  is  allocated  according  to  the  1970  optimum. 

You  probably  know  that  sample  allocations  which  differ  by  a  small  amount  from 
optimum  will  result  in  a  small  or  negligible  increase  in  variance.     This  is  fortunate 
because  in  practice,  optimum  allocation  can  at  best  only  be  approximated.  Moreover, 
the  optimum  allocation  varies  among  the  characteristics  included  in  a  survey. 

l-Then  the  within-stratum  standard  deviations,  S^,  are  equal,  the  optimum  allo- 
cation is  the  same  as  allocating  the  sample  in  proportion  to  N,  .     It  follows  when 

h 

the  standard  deviations,  S^,  are  moderately  different  that  the  variance  for  optimum 

allocation  will  be  only  slightly  less  than  the  variance  for  an  allocation  proportion- 
ate to  N  .     And  assuming  small  variation  in  the  unknovm  values  of  S,  ,  estimates  of 
h  ^ 
must  be  precise,  or  an  effort  to  reduce  variance  by  optimizing  the  allocation 

could  result  in  an  increase  in  variance.  This  suggests  that  rather  large  differences 
in  the  S    might  be  necessary  before  optimizing  the  allocation  is  worthwhile. 


Turn  to  tables  12,  13,  and  14  and  study  the  sample  allocations  in  relation  to 
the  variances  and  design  factors  presented  at  the  bottom  of  each  table.  Reference 
to  table  11  may  be  helpful  in  understanding  or  interpreting  the  results.     In  parti- 
cular, note  the  wide  variation  in  S,    for  potatoes.     The  largest  S,    in  1971  was  about 

h  n 

33  times  larger  than  the  smallest  and  the  reduction  in  variance  attributable  to  opti- 
mum allocation  was  substantial.     Perhaps  of  greater  importance,  from  a  practical 
point  of  view,  is  the  fact  that  when  the  1970  optimum  allocation  was  used  in  1971, 
the  design  factor  was  0.502  compared  to  0.472  for  the  1971  optimum.     That  is,  the 
1970  optimum  allocation  was  nearly  as  effective  in  1971  as  the  1971  optimum. 

Exercise  22.     Due  to  an  interest  in  allocating  the  sample  to  minimize  the  sam- 
pling variance  for  potatoes ^  suppose  a  proposal  has  been  made  to  allocate  the  sample 
according  to  the  19':'0  optimum  allocation  for  potatoes.    What  would  the  relative 
variances  in  1971  be  for  alfalfa  and  beef  cattle? 

Usually  some  prior  information  about  the  values  (or  relative  values)  of  Y^^  is 

available.     Suppose  accurate  estimates  of  the  stratum  totals,  Y,  ,  exist  for  a  pre- 

h 

vious  year,  but  no  estimates  of  the        are  available.     A  sample  could  be  allocated 
in  proportion  to  the  estim.ates  of  Y^^  or  in  proportion  to  some  function  of  the  esti- 
mates of  Y^,  such  as  the  square  root  of  the  estimates.     If  Y^  is  approximately  in 

proportion  to  N.  S.  ,  a  sample  allocated  in  proportion  to  Y,   will  be  close  to  optimum, 
h  h  h 

Exercise  23.    Show^  algebraically ^  that  the  optimum  size  of  sample  from  stratum 

S 

h  is  proportional  to  the  stratum  total  Y,  , when    h  is  constant. 
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For  the  three  commodities,  examine  the  differences  between  the  allocations  in 
proportion  to  Y    in  1970  and  the  optimum  allocations.     Generally,  strata  which  have 

-  S 

the  largest  values  of  Y,   will  have  the  smallest  coefficients  of  variation,     h,  which 
n 

means  (compared  to  optimum)  that  allocating  a  sample  in  proportion  to  Y^  will  allo- 
cate too  much  of  the  sample  to  strata  with  large  values  of  Y^  and  not  enough  to 

strata  with  relatively  small  values  of  Y^.     This  phenomenon  is  apparent  in  tables  12, 

13  and  14.     With  experience,  and  in  the  absence  of  estimates  of  S^,  one  might  decide 

to  allocate  the  sample  in  proportion  to  estimates  of  Y^^  and  then  arbitrarily  increase 
the  sample  by  50  percent  or  more  for  strata  having  the  smallest  values  of  Y^. 

Table  11  —  Stratum  (crop  reporting  district)  totals  for  1970 


Alfalfa 

Deer 

cattle 

Potatoes 

CRD 

Farms  . 
N  '. 

h  ; 

Farms  t 
reporting: 

Total 
acres 

\ 

:  Farms 

:  reporting 

Cattle  : 
\ 

Farms 
reporting 

Total 
acres 

\ 

1 

10,748 

5,502 

236,900 

Number 
3,012 

48,907 

27 

1,737 

2 

11,166 

3,462 

106,157 

2,189 

26,009 

87 

5,524 

3 

5,917 

3,780 

149,872 

1,071 

16,433 

134 

7,826 

4 

15,342 

12,146 

531,718 

4,552 

106,283 

26 

698 

5 

9,616 

5,853 

212,840 

2,193 

39,790 

245 

16,820 

6  : 

15,164 

12,893 

543,090 

2,959 

36,839 

43 

1,213 

7 

13,654 

12,115 

548,712 

4,879 

167,775 

10 

642 

8 

14,315 

11,558 

420,207 

4,767 

128,472 

31 

1,032 

9 

5,763 

4,125 

157,222 

1,273 

25,451 

138 

4,587 

Total 

:101,685 

71,434 

2,906,718 

26,859 

595,961 

741 

40,079 
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Table  12    —  Alternative  sample  allocations  and  sampling  variances, 

alfalfa 


CRD 


Allocations  of  a  sample  of  1,000  farms 


Number  of  farms,  N, 


1970 
(35) 


1971 
(36) 


Item  total,  Y,         Optimum,  N  S, 
h  :  h  h 


1970  :  1971 
(37)     :  (38) 


1970 


1971 
iAOL 


Standard 
deviation,  S, 


1970  :  1971 
(^1)  (AD 


1 
2 
3 
4 
5 
6 
7 
8 
9 

Total 


106 
110 

58 
151 

94 
149 
134 
141 

57 
1,000 


105 
111 

61 
150 

90 
150 
138 
138 

57 
1,000 


81 
36 
52 
183 
73 
187 
189 
145 
54 


82 
39 
49 
179 
70 
190 
190 
144 
57 


1,000  1,000 


114 
70 
56 
168 
88 
158 
147 
139 
60 
1,000 


110 
72 
55 
166 
84 
154 
146 
137 
76 
1,000 


33.2 
19.6 
29.5 
34.0 
28.7 
32.6 
33.8 
30.3 
32.5 
32.3 


34.9 
21.7 
30.5 
37.1 
31.0 
34.5 
35.3 
33.3 
44.1 
35.1 


1/ 


Relative  variances  — 


Allocation  of  sample 

No  stratification 

Proportionate  to  N, 
n 

Proportionate  to 
Optimum 


According  to  Y,    in  1970 
n 

According  to  N,  S,    in  1970 
n  n 


1970 
1.277^/ 


1971 


3/ 


1.306 

1.178  (35)  4/  1.213  (36) 


Design  factors 
1970  1971 
1.000  1.000 
.922  .929 


1.228  (37) 
1.156  (39) 


1.261  (38) 

1.186  (40) 

1.273  (37) 

1.192  (39) 


.962 
.905 


.966 

.908 
.975 
.913 


—  =  Not  available. 

1/  Assumes  n  =  1  rather  than  n  =  1,000 
2/  From  table  1 
3/  From  table  2 

U_/  Numbers  in  parentheses  refer  to  the  column  numbers  of  the  sample  allocations 
corresponding  to  the  relative  variances. 
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Table  13    ~  Alternative  sample  allocations  and  sampling  variances, 

beef  cattle 


CRD 

Allocations  of  a  sample 

of  1,000  farms 

Standard 

deviation,  S, 
n 

Number  of 

farms,  N 

h 

Item  total,  Y, 
n 

Optimum 

'  \\ 

X  7  /  u 

(35) 

1  Q7  1 

(36) 

1970  : 

1971 

n8^ 

V  JO  J 

1970  : 
(39)  : 

1971 
(40) 

1970 
(41) 

1971 
(41) 

1 

106 

105 

82 

74 

73 

66 

14.3 

14.2 

2 

110 

111 

44 

52 

50 

52 

9.6 

10.7 

3 

58 

61 

27 

32 

36 

36 

12.8 

13.4 

4 

151 

150 

178 

185 

155 

150 

21.4 

22.8 

5 

94 

90 

67 

70 

73 

92 

ID  .  1 

Z  J  .  z 

6 

149 

150 

62 

72 

82 

99 

11.5 

15.1 

7 

134 

138 

281 

286 

242 

232 

37.8 

38.4 

8 

141 

138 

216 

186 

235 

201 

35.0 

33.5 

9 

57 

57 

43 

43 

54 

72 

19.9 

28.5 

Total 

1,000 

1,000 

1,000 

1,000 

1,000 

1,000 

23.5 

24.9 

Allocation  of  sample 
No  stratification 
Proportionate  to  N 


1/ 


Relative  variance  — 


1970 


16.07 


2/ 


1971 


15.06 


3/ 


4/ 


15.74  (35)  -  14.77 


Design  factor 
1970  1971 


1.000 
.979 


1.000 
.981 


Proportionate  to  Y 


Optimum,  N  S, 
n  n 


According  to  Y,    in  1970 
h 


According  to  N,  S,    in  1970 
n  n 


13.02  (37) 
12.71  (39) 


13.30  (38) 

12.61  (40) 

13.52  (37) 

12.87  (39) 


—  =  Not  available. 

1/     Assumes  n  =  1  rather  than  n  =  1,000 
_2/     From  table  1 
3/    From  table  2 

4_/    Numbers  in  parentheses  refer  to  the  column  numbers 
corresponding  to  the  relative  variances. 


.810 
.791 


.883 
.837 
.898 
.855 


of  the  sample  allocations 
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Table  14    —  Alternative  sample  allocations  and  sampling  variances, 

potatoes 


Allocations  of 

a  sample 

of  1,000 

farms 

Standard 

deviation,  S, 
n 

CRD 

Number  of 

farms, 

h 

Item  total,  Y, 
n 

■  Optimum 

•  \\  : 

1970 
(35) 

1971 
(36) 

1970 
(37) 

:  1971 
:  (38) 

:     1970  : 
:     (39)  : 

1971  : 
(40)  : 

1970  : 
(41)  : 

1971 
(41) 

1 

106 

105 

43 

34 

77 

70 

5 

8 

5.6 

2 

110 

111 

138 

168 

155 

183 

11 

2 

13.9 

3 

58 

61 

195 

139 

140 

94 

19 

1 

13.0 

4 

151 

150 

17 

10 

32 

17 

1 

7 

1.0 

5 

94 

90 

421 

451 

329 

352 

27 

6 

32.8 

6 

149 

150 

30 

23 

71 

39 

3 

8 

2.2 

7 

134 

138 

16 

29 

60 

89 

3 

6 

5.4 

8 

141 

138 

26 

32 

63 

65 

3 

6 

4.0 

9 

57 

57 

114 

114 

73 

91 

10.2 

13.4 

Total 

1,000 

1,000 

1,000 

1,000 

1,000  1 

,000 

11 

1 

12.3 

Allocation  of  sample 

No  stratification 

Proportionate  to  N, 
n 

Proportionate  to  Y 


Optimum,  N,  S, 
h  n 


According  to  Y,    in  1970 
n 


According  to  N^S^  in  1970 
n  n 


1/ 


Relative  variances  — 


1970 


789 


2/ 


787  (35) 
535  (37) 
405  (39) 


4/ 


1971 
809  - 
807  (36) 

480  (38) 

382  (40) 

570  (37) 

406  (39) 


Design  factors 
1970  1971 
1.000 


1.000 
.997 

.678 

.513 


.998 
.593 
.472 
.704 
.502 


—  =  Not  available. 

1/    Assumes  n  =  1  rather  than  n  =  1,000 
_2/    From  table  1 
3/    From  table  2 

4_/    Numbers  in  parentheses  refer  to  column  number  of  the  sample  allocations 
corresponding  to  the  relative  variances. 
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APPENDIX 


Table 

Column 

Number 

(1) 
(2) 


Description 

Characteristics 

N 


th 


ZY,,  where  Y,  is  the  value  of  characteristic  Y  for  the  i      farm  in 
1  i 

the  population  of  N  farms. 


(3) 


i  =  Y 


(4) 


2   (Y^  -  Y) 
N^O 


(5) 


N 

I  (Y. 


Y) 


Y^  (N-1) 


=  V 


(6) 


N  ,  the  number  of  farms  in  the  population  with  Y.  >  0.  All  farms 
have  some  farmland  so  N^  for  farmland  is  equal  to  N. 


(7) 


(8) 


2x  =  P 
N 

N 

Z^Y  . 


Y^      r  is  used  to  designate  a  subset  of  farms  with  Y^  >  0. 


In  an  expression  like  ZY     ,  i  is  an  index  of  farms  in 
ri 

N  N 

the  subset.     E^Y  .  =  ZY, 
ri  1 


(9) 

N 

i""  a  . 

ri 

-Y  )2 
r 

\ 

N 

r 

-  1 

(10) 

N 

=  V 


(11) 


PV 
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Table 
Column 

Number  Description 

,,2  V 

V  X  2 

(12)  1    +      X      -    2p    — ,  where  V    is  the  relative  variance  of  farmland  and 

Vy  ^ 

2 

V      is  the  relative  variance  for  any  other  characteristic,     p  is  the  cor- 
^    relation  between  x  and  y. 


(13)  p  = 


I   (Y.   -     Y)      (X.     -  X) 

  where  X  is  acres  of  farmland  and  Y  is 


2  -  2 

'      I   (Y.-Y)       Z(X.  -  X) 


  any  other  characteristic. 


(14) 

M 

(15)  M  _  _ 

^^^ti  -  V        where  Y^=-^ 

Y^  (M-1) 

The  subscript  t  indicates  that  the  unit  is  a  township.  There  are  M  town- 
ships in  the  population,  Y^^  is  the  total  of  Y  for  the  i^^  township  and 

Y^  is  the  average  value  of  Y  per  township.     This  column  is  the  relative 
variance  among  the  M  values  of  See  equation  10. 

M  M 

Z   (Y      -  RN  )  ZY 

(16)  ti  i         where  R  =      ti    and  N.  is  the  number 

1 

of  farms  in  the  i^^  township.     Column  (16)  is  the  relative  variance 
Y 

of  the  ratio    ti.     Relates  to  equation  12. 

N. 

1 

M  Y 

(17)  J        '^f  - 


Y^N 

This  column  is  the  relative  variance  among  townships  when  selected 
with  probabilities  proportionate  to  N^^.     See  equation  14. 
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Table 

Column 

Number 


(18) 


Description 


N 

Z(Y. 


Y)2 


where  N  =  ^ 


(N-1)  J 

This  is  the  relative  variance  of  y  for  a  random  sample  of  n=N  farms. 


(19) 

(15) 

(18) 

(20) 

(16) 

(18) 

(21) 

(17) 

(18) 

(22) 

(18) 

(18) 

(23) 

(15) 

(15) 

(24) 

(16) 

(15) 

J 

V.-L  /  y 

(15) 

(26) 

(18) 

(15) 

(27) 

Stratum 

CRD 

(28) 

County 

(29) 

Township 

1 


N 

Z  (Y. 


Y) 


N-1 


Columns  (27),  (28),  and  (29)  are  average  within-stratum  variances  divided 
by  the  overall  variance. 


(30) 


h  is  the  index  for  strata, 

N,    is  the  number  of  farms  in  stratum  h, 
h 

Y,  .  is  the  value  of  Y  for  the  ith  farm  in  stratum  h,  and 
hi 

is  the  average  value  of  Y  in  stratum  h. 


h 
N 


2  2 


i;(Y.  -  Y) 


(N-1) 
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Table 

Column 

Number 


Description 


where  n^  is  the  size  of  sample  from  stratum  h, 

n  =  Zn,  ,  and 
h 

^   (Y        -    Y  )^ 
„2        i   ''  hi  h'' 


2  2 
\  % 

The  quantity  S   is  the  variance  of  an  estimate  of  the  population 

"h 

total  for  a  sample  of  size  n.     The  factor  — ^  changes  this  variance  to 

N 

the  variance  of  a  mean  of  a  stratified  random  sample  assuming  a  hypo- 
thetical sample  of  n=l.     For  column  (30),  optimum  allocation  of  n  to 

CRD's  is  used  to  determine  the  n,  . 

h 


(31) 


This  is  the  average  within-CRD  variance  among  farms  divided  by  Y  .  It 
is  the  relative  variance  of  a  stratified  random  sample  with  allocation 
proportionate  to  N^,  assuming  a  hypothetical  sample  of  n=l. 


(32) 


where  S 


th 


E  (Y  ^  .  -  RN^  .  ) ' 
^  ^  thi  hi'^ 

M,    -  1 


Y^^^     is  the  total  of  Y  for  the  i^^  township  in  stratum  h, 

N,  .  is  the  number  of  farms  in  the  i^^  township  in  stratum  h. 
hi 


This  column  is  the  average  within-CRD  variance  among  townships  for 


the  combined  ratio  estimator,  divided  by  Y 


r,2 


(33)  N  times  column  (16) 

Column  (5) 

where  N  is  the  average  number  of  farms  per  township. 


(34) 


N  times  column  (32) 
Column  (31) 


(35)  through  (40)     These  columns  show  alternative  allocations  of  a 

sample  of  1,000  farms  to  CRDs. 
(41)  S-^  ,  standard  deviations  of  Y  within  CRDs. 
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